31 resultados para Data mining

em QUB Research Portal - Research Directory and Institutional Repository for Queen's University Belfast


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the last decade, data mining has emerged as one of the most dynamic and lively areas in information technology. Although many algorithms and techniques for data mining have been proposed, they either focus on domain independent techniques or on very specific domain problems. A general requirement in bridging the gap between academia and business is to cater to general domain-related issues surrounding real-life applications, such as constraints, organizational factors, domain expert knowledge, domain adaption, and operational knowledge. Unfortunately, these either have not been addressed, or have not been sufficiently addressed, in current data mining research and development.Domain-Driven Data Mining (D3M) aims to develop general principles, methodologies, and techniques for modeling and merging comprehensive domain-related factors and synthesized ubiquitous intelligence surrounding problem domains with the data mining process, and discovering knowledge to support business decision-making. This paper aims to report original, cutting-edge, and state-of-the-art progress in D3M. It covers theoretical and applied contributions aiming to: 1) propose next-generation data mining frameworks and processes for actionable knowledge discovery, 2) investigate effective (automated, human and machine-centered and/or human-machined-co-operated) principles and approaches for acquiring, representing, modelling, and engaging ubiquitous intelligence in real-world data mining, and 3) develop workable and operational systems balancing technical significance and applications concerns, and converting and delivering actionable knowledge into operational applications rules to seamlessly engage application processes and systems.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background. The assembly of the tree of life has seen significant progress in recent years but algae and protists have been largely overlooked in this effort. Many groups of algae and protists have ancient roots and it is unclear how much data will be required to resolve their phylogenetic relationships for incorporation in the tree of life. The red algae, a group of primary photosynthetic eukaryotes of more than a billion years old, provide the earliest fossil evidence for eukaryotic multicellularity and sexual reproduction. Despite this evolutionary significance, their phylogenetic relationships are understudied. This study aims to infer a comprehensive red algal tree of life at the family level from a supermatrix containing data mined from GenBank. We aim to locate remaining regions of low support in the topology, evaluate their causes and estimate the amount of data required to resolve them. Results. Phylogenetic analysis of a supermatrix of 14 loci and 98 red algal families yielded the most complete red algal tree of life to date. Visualization of statistical support showed the presence of five poorly supported regions. Causes for low support were identified with statistics about the age of the region, data availability and node density, showing that poor support has different origins in different parts of the tree. Parametric simulation experiments yielded optimistic estimates of how much data will be needed to resolve the poorly supported regions (ca. 103 to ca. 104 nucleotides for the different regions). Nonparametric simulations gave a markedly more pessimistic image, some regions requiring more than 2.8 105 nucleotides or not achieving the desired level of support at all. The discrepancies between parametric and nonparametric simulations are discussed in light of our dataset and known attributes of both approaches. Conclusions. Our study takes the red algae one step closer to meaningful inclusion in the tree of life. In addition to the recovery of stable relationships, the recognition of five regions in need of further study is a significant outcome of this work. Based on our analyses of current availability and future requirements of data, we make clear recommendations for forthcoming research.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We conducted data-mining analyses of genome wide association (GWA) studies of the CATIE and MGS-GAIN datasets, and found 13 markers in the two physically linked genes, PTPN21 and EML5, showing nominally significant association with schizophrenia. Linkage disequilibrium (LD) analysis indicated that all 7 markers from PTPN21 shared high LD (r(2)>0.8), including rs2274736 and rs2401751, the two non-synonymous markers with the most significant association signals (rs2401751, P=1.10 × 10(-3) and rs2274736, P=1.21 × 10(-3)). In a meta-analysis of all 13 replication datasets with a total of 13,940 subjects, we found that the two non-synonymous markers are significantly associated with schizophrenia (rs2274736, OR=0.92, 95% CI: 0.86-0.97, P=5.45 × 10(-3) and rs2401751, OR=0.92, 95% CI: 0.86-0.97, P=5.29 × 10(-3)). One SNP (rs7147796) in EML5 is also significantly associated with the disease (OR=1.08, 95% CI: 1.02-1.14, P=6.43 × 10(-3)). These 3 markers remain significant after Bonferroni correction. Furthermore, haplotype conditioned analyses indicated that the association signals observed between rs2274736/rs2401751 and rs7147796 are statistically independent. Given the results that 2 non-synonymous markers in PTPN21 are associated with schizophrenia, further investigation of this locus is warranted.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The last decade has witnessed an unprecedented growth in availability of data having spatio-temporal characteristics. Given the scale and richness of such data, finding spatio-temporal patterns that demonstrate significantly different behavior from their neighbors could be of interest for various application scenarios such as – weather modeling, analyzing spread of disease outbreaks, monitoring traffic congestions, and so on. In this paper, we propose an automated approach of exploring and discovering such anomalous patterns irrespective of the underlying domain from which the data is recovered. Our approach differs significantly from traditional methods of spatial outlier detection, and employs two phases – i) discovering homogeneous regions, and ii) evaluating these regions as anomalies based on their statistical difference from a generalized neighborhood. We evaluate the quality of our approach and distinguish it from existing techniques via an extensive experimental evaluation.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The problem of detecting spatially-coherent groups of data that exhibit anomalous behavior has started to attract attention due to applications across areas such as epidemic analysis and weather forecasting. Earlier efforts from the data mining community have largely focused on finding outliers, individual data objects that display deviant behavior. Such point-based methods are not easy to extend to find groups of data that exhibit anomalous behavior. Scan Statistics are methods from the statistics community that have considered the problem of identifying regions where data objects exhibit a behavior that is atypical of the general dataset. The spatial scan statistic and methods that build upon it mostly adopt the framework of defining a character for regions (e.g., circular or elliptical) of objects and repeatedly sampling regions of such character followed by applying a statistical test for anomaly detection. In the past decade, there have been efforts from the statistics community to enhance efficiency of scan statstics as well as to enable discovery of arbitrarily shaped anomalous regions. On the other hand, the data mining community has started to look at determining anomalous regions that have behavior divergent from their neighborhood.In this chapter,we survey the space of techniques for detecting anomalous regions on spatial data from across the data mining and statistics communities while outlining connections to well-studied problems in clustering and image segmentation. We analyze the techniques systematically by categorizing them appropriately to provide a structured birds eye view of the work on anomalous region detection;we hope that this would encourage better cross-pollination of ideas across communities to help advance the frontier in anomaly detection.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Association rule mining is an indispensable tool for discovering
insights from large databases and data warehouses.
The data in a warehouse being multi-dimensional, it is often
useful to mine rules over subsets of data defined by selections
over the dimensions. Such interactive rule mining
over multi-dimensional query windows is difficult since rule
mining is computationally expensive. Current methods using
pre-computation of frequent itemsets require counting
of some itemsets by revisiting the transaction database at
query time, which is very expensive. We develop a method
(RMW) that identifies the minimal set of itemsets to compute
and store for each cell, so that rule mining over any
query window may be performed without going back to the
transaction database. We give formal proofs that the set of
itemsets chosen by RMW is sufficient to answer any query
and also prove that it is the optimal set to be computed
for 1 dimensional queries. We demonstrate through an extensive
empirical evaluation that RMW achieves extremely
fast query response time compared to existing methods, with
only moderate overhead in pre-computation and storage

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Data identification is a key task for any Internet Service Provider (ISP) or network administrator. As port fluctuation and encryption become more common in P2P traffic wishing to avoid identification, new strategies must be developed to detect and classify such flows. This paper introduces a new method of separating P2P and standard web traffic that can be applied as part of a data mining process, based on the activity of the hosts on the network. Unlike other research, our method is aimed at classifying individual flows rather than just identifying P2P hosts or ports. Heuristics are analysed and a classification system proposed. The accuracy of the system is then tested using real network traffic from a core internet router showing over 99% accuracy in some cases. We expand on this proposed strategy to investigate its application to real-time, early classification problems. New proposals are made and the results of real-time experiments compared to those obtained in the data mining research. To the best of our knowledge this is the first research to use host based flow identification to determine a flows application within the early stages of the connection.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Many of the challenges faced in health care delivery can be informed through building models. In particular, Discrete Conditional Survival (DCS) models, recently under development, can provide policymakers with a flexible tool to assess time-to-event data. The DCS model is capable of modelling the survival curve based on various underlying distribution types and is capable of clustering or grouping observations (based on other covariate information) external to the distribution fits. The flexibility of the model comes through the choice of data mining techniques that are available in ascertaining the different subsets and also in the choice of distribution types available in modelling these informed subsets. This paper presents an illustrated example of the Discrete Conditional Survival model being deployed to represent ambulance response-times by a fully parameterised model. This model is contrasted against use of a parametric accelerated failure-time model, illustrating the strength and usefulness of Discrete Conditional Survival models.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Discrete Conditional Phase-type (DC-Ph) models are a family of models which represent skewed survival data conditioned on specific inter-related discrete variables. The survival data is modeled using a Coxian phase-type distribution which is associated with the inter-related variables using a range of possible data mining approaches such as Bayesian networks (BNs), the Naïve Bayes Classification method and classification regression trees. This paper utilizes the Discrete Conditional Phase-type model (DC-Ph) to explore the modeling of patient waiting times in an Accident and Emergency Department of a UK hospital. The resulting DC-Ph model takes on the form of the Coxian phase-type distribution conditioned on the outcome of a logistic regression model.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The skin of fish is the first line of defense against pathogens and parasites. The skin transcriptome of the Atlantic salmon is poorly characterized, and currently only 2,089 expressed sequence tags (ESTs) out of a total of half a million sequences are generated from skin-derived cDNA libraries. The primary aim of this study was to enhance the transcriptomic knowledge of salmon skin by using next-generation sequencing (NGS) technology, namely the Roche-454 platform. An equimolar mixture of high-quality RNA from skin and epidermal samples of salmon reared in either freshwater or seawater was used for 454-sequencing. This technique yielded over 600,000 reads, which were assembled into 34,696 isotigs using Newbler. Of these isotigs, 12 % had not been sequenced in Atlantic salmon, hence representing previously unreported salmon mRNAs that can potentially be skin-specific. Many full-length genes have been acquired, representing numerous biological processes. Mucin proteins are the main structural component of mucus and we examined in greater detail the sequences we obtained for these genes. Several isotigs exhibited homology to mammalian mucins (MUC2, MUC5AC and MUC5B). Mucin mRNAs are generally > 10 kbp and contain large repetitive units, which pose a challenge towards full-length sequence discovery. To date, we have not unearthed any full-length salmon mucin genes with this dataset, but have both N- and C-terminal regions of a mucin type 5. This highlights the fact that, while NGS is indeed a formidable tool for sequence data mining of non-model species, it must be complemented with additional experimental and bioinformatic work to characterize some mRNA sequences with complex features.